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Abstract 

An object recognition system has been developed that uses a . 
new class of local image features. The features are invariant 
to image scaling, translation, and rotation, and partially in- 
variant to illumination changes' and affine or 3D projection. 
These features share similar properties with neurons in in- 
ferior temporal cortex that ai-e used for object i-ecqgnition 
in primate vision. Features are efficiently detected through 
a staged filtering approach that iden tifies stab le ^poin ts in 
scale space. Image keys are created that allow for local ge- 
ometric deformations, by representing bluri-ed, image gradi- 
ents in multiple orientation planes and at multiple scales. 
The keys are- used as input to a. nearest-neighbor indexing 
method that identifies candidate object matches. Final veri- 
fication of each match is achieved by finding a low-residual 
I east-squaf^es solution for the unknown model, parameters. 
Experimental results show that robust object recognition 
can be achieved in cluttered partially-occluded images with 
a computationtime of under 2 seconds.. 

1. Introduction 

Object recognition in cluttered real-world scenes requires 
local image features that are unaffected by nearby clutter or 
partial occlusion!. The features must be at least partially in- 
variant to illumination^ 3D projective transforms, and com- 
mon object variations. On the other hand, the features must 
also be sufficiently distinctive to identify specific objects 
among many alternatives. The difficulty of the object recog- 
nition problem is due in large part to the lack of success in 
finding such 'image features. However,' recent research on 
the use of dense local features (e.g., Schmid & Mohr [19]) 
has shown that efficient recognition can often be achieved 
by using local image descriptors sampled at a large number 
of repeatable locations. 

This paper presents a new method for image feature gen- 
eration called the Scale Invariant Feature Transform (SIFT). 
This approach transforms an image into a large collection 
of local feature vectors, each of which is invariant to image 



translation, scaling^and rotation, and partially invariant to 
illumination changes and affine! or 3D projection. Previous 
approaches to local feature generation lacked in variance to 
scale and were more sensitive to projective distortion and 
illumination change The / SIFT features share a number of 
properties in common with the responses of neurons in infe- 
rior temporal (IT) cortex in primate vision This paper also 
describes improved approaches to indexing and model ver- 
ification. ... y 1 \y\ t "' y "' *' M : 

' Tlie. scale-invariant features are efficiently identified by 
using, a staged filtering. approach. The first stage identifies 
key locations in scale space, by looking for locations that 
are maxima or minima of a difference-of-Gaussian function. 
Each point is used to generate a feature vector that describes 
the local image region sampled relative to its scale-space co- 
ordinate frame The. features achieve partial in variance to 
local, .variations,, such as affine or 3D projections^' by blur- 
ring image. gradient .locatY6ns. This approach is based on a 
model of the behavior of complex cells in the cerebral cor- 
tex of mammal i an vision. The resulting feature vectors are 
called SIFT. .keys. In the current impl .j^^t^t|6ni'.eadi im- 
age generates on the order of 1 000 SIFT keysj'a process that 
requires less than 1 second of computation time, ' 

The SIFT keys derived from an image are used in a 
nearest-neighbour approach to indexing to identify candi- 
date object models.. Collections of keys that agree on a po- 
tential mode! pose are first identified through a Hough trans- 
form hash table, and then through a least-squares fit to a final 
estimate of model parameters. When at least 3 keys agree 
oh the model parameters with low. residual, there is strong 
evidence for the presence of the object. Since there may be 
dozens of SIFT keys in the image of a typical object, it is 
possible to have? substantial levels of occlusion in the image 
and yet retain high levels of reliability. 

The current object models are represented as 2D loca- 
tions of SIFT keys that can undergo affine projection. Suf- 
ficient variation in feature location is allowed to recognize 
perspective projection of planar shapes at up to a 60 degree 
rotation away from the camera or to allow up to a 20 degree 
rotation of a 3D object. 
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2. Related research 

Object recognition is widely used in the machine vision in- 
dustry for the purposes of inspection, registration; and ma- 
nipulation. However, current commercial systems for object 
recognition depend almost exclusively on correlation-based 
template matching. While very effective for certain engi- 
neered environments, where object pose and illumination 
are tightly controlled, template matching becomes computa- 
tionally infeasible when object rotation, scale, illumination, 
and 3D pose are allowed to vary, and even more so when 
dealing with partial visibility and large model databases. 

An alternative to searching all image locations for 
matches is to extract features from the image that are at 
least partially invariant to the image formation process and 
matching only to those features. Many candidate feature 
types have been proposed and explored, including line seg- 
ments [6], groupings of edges [11, 14], and regions [2], 
among many other proposals. While these features have 
worked well for certain object classes, they are often not de- 
tected frequently enough or with sufficient stability to form 
a basis for reliable recognition. 

There has been recent work on developing much denser 
collections of image features. One approach has been to 
use a corner detector (more accurately, a detector of peaks 
in local image variation) to identify repeatable image loca- 
tions, around which local image properties can be measured^ 
Zhang et dl. [23] used the Harris corner detector to iden- 
tify feature locations for epipolar alignment of images taken 
from differing Viewpoints. Rather than attempting to cor- 
relate regions from one image against all possible regions 
in a second image, large sayings in computation time were 
achieved by only matching regions centered at corner points 
in each image. " ' ■ , 

For the object recognition problem, Schmid & Mohr 
[19] also used the Harris corner detector to identify in- 
terest points, and then created a local image descriptor at 
each interest point from an orientation-invariant vector of 
derivative-of-Gaussian image measurements. These image 
descriptors were used for robust object recognition by look- 
ing for multiple matching descriptors that satisfied object- 
based orientation and location constraints. This work was 
impressive both for the speed of recognition in a large 
database and the ability to handle cluttered images. 

The corner detectors used in these previous approaches 
have a major failing, which is that they examine an image 
at only a single scale. As the change in scale becomes sig- 
nificant, these detectors respond to different image points. 
Also, since the detector does not provide an indication of the 
object scale, it is necessary to create image descriptors and 
attempt matching at a large number of scales. This paper de- 
scribes an efficient method to identify stable key locations 
in scale space. This means that different scalings of an im- 
age will have no effect on the set of key locations selected. 



Furthermore, an explicit scale is determined for each point, 
which allows the image description vector for that point to 
be sampled at an equivalent scale in each image. A canoni- 
cal orientation is determined at each location, so that match- 
ing can be performed relative to a consistent local 2D co- 
ordinate frame. This allows for the use of more distinctive 
image descriptors than the rotation-invariant ones used by 
Schmid and Mohr, and the descriptor is further modified to 
improve its stability to changes in affine projection and illu- 
mination. 

Other approaches to appearance-based recognition in- 
clude eigenspace matching [13], color histograms [20], and 
receptive field histograms [18]. These approaches have all 
been demonstrated successfully on isolated objects or pre- 
segmented images, but due to their more global features it 
has been difficult to extend them to cluttered and partially 
occluded images. Ohba & Ikeuchi [15] successfully apply 
the eigenspace approach to cluttered images by using many 
small local ei gen-windows, but this then requires expensive 
search for each window in a new image, as with template 
matching. 

3. Key localization 

We wish to identify locations in image scale space that are 
invariant with respect to image translation, scaling, and ro- 
tation, and are minimally affected by noise and small dis- 
tortions. Lindeberg [8] has shown that under some rather 
general assumptions on scale invariance, the Gaussian ker- 
nel and its derivatives are the only possible smoothing ker- 
nels for scale space analysis. 

To achieve rotation invariance and a high level of effi- 
ciency, we have chosen to select key locations at maxima 
and minima of a difference of Gaussian function applied in 
scale space. This can be computed very efficiently by build- 
ing an image pyramid with resampling between each level. 
Furthermore, it locates key points at regions and scales of 
high variation, making these locations particularly stable for 
characterizing the image. Crowley & Parker [4] and Linde- 
berg [9] have previously used the difference-of-Gaussian in 
scale space for other purposes. In the following, we describe 
a particularly efficient and stable method to detect and char- 
acterize the maxima and minima of this function. 

As the 2D Gaussian function is separable, its convolution 
with the input image can be efficiently computed by apply- 
ing two passes of the ID Gaussian function in the horizontal 
and vertical directions: 



For key localization, all smoothing operations are done us- 
ing cr = which can be approximated with sufficient ac- 
curacy using a ID kernel with 7 sample points. 
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The input image is first convolved with the Gaussian 
function using a = y/2 to give an image A. This is then 
repeated a second time with a further incremental smooth- 
ing of a = y/2 to give a new image, B, which now has an 
effective smoothing of c = 2. The difference of Gaussian 
function is obtained by subtracting image B from A, result- 
ing in a ratio of 2/V5 = y/2 between the two Gaussians. 

To generate the next pyramid level, we res ample the al- 
ready smoothed image B using bilinear interpolation with a 
pixel spacing of 1.5 in each direction. While it may seem 
more natural to resample with a relative scale of y/2, the 
only constraint is that sampling be frequent enough to. de- 
tect peaks. The 1.5 spacing means that each new sample will 
be a constant linear combination of 4 adjacent pixels. This 
is efficient to compute and minimizesaliasing artifacts that . 
would arise from changing the. resampli ng coefficients. , . 

Maxima and minima of this, scale-space function are de-. 
termined by comparing each pixel in the pyramid to its. 
neighbours. First, a pixel is compared to its 8 neighbours at . 
the same level of the pyramids If it is a maxima or minima 
at this level, then the closest-pixel .location is calculated at 
the next lowest level of the pyramid, taking account of the 
1.5 timesxesampling; If the pixel remains^higher (or lower) . 
than this closest pixel andjits 8 neighbours, then the test is 
repeated for the Level above. Since most pixels will be.elim- ; 
inated within a few. comparisons, the cost of this detection is 
small and. much lower than that of building thepyramid. 

If the first level of the pyramid is sampled at the same rate 
as the input image, thehighest spatial frequencies will be ig- 
nored.;This is clue to the initial smoothing, which is needed 
to provide separation, of peaks for robust detection. There- 
fore, we expand the input image ,by a factor of 2, using bilin T 
ear interpolation, prior to building the pyramid. This gives 
on the order of 1000 key.points for a typical 512 x 512 pixel 
image, compared to only a quarter as many without the ini- 
tial expansion.. . 

3.1. SBFT key stability 

To characterize the image at each key location, the smoothed 
image A at each level of the pyramid is processed to extract 
image gradients and orientations. At each pixel, Aij, the im- 
age gradient magnitude, Jtfy , and orientation, R y , are com- 
puted using pixel differences: 

Mij = y/(Aij - Ai+ij)* + (A^ - A iJ+1 y- 

Rij = atan2 (Aij — j, A{ j +1 - Aij) 

The pixel differences are efficient to compute and provide 
sufficient accuracy due to the substantial level of previous 
smoothing. The effective half-pixel shift in position is com- 
pensated for when determining key location. 

Robustness to illumination change is enhanced by thresh- 
olding the gradient magnitudes at a value of 0.1 times the 




Figure 1: The second image was generated from the first by 
rotation, scaling; stretching, change of brightness and con- 
trast, arid addition of pixel 5 noise. In spite of these changes, 
78%' of the keys from the first image have a closely; match- 
ing key in the second irrialge. These -examples show only a 
subset of the keys to reduce clutter. ■ 

maximum possible gradient value. This 1 reduces the effect 
of a change m illumination direction for a surface with 3D 
relief, as ah illumination change may result in large changes 
to gradient magnitude but i si ikely to have less influence on 
gradient oH en tati on. ; ? ' ! ' ■" 

Each key location is assigned a. canonical orientation so 
that the iri?age descriptors are invariant to rotation. In or- 
der to make this as stable as possible against lighting or coh- 
trast cnanges. th'e orientation is determined by the peak in a 
histogram of local image gradient orientations. The orien- 
tation histogram is created using a 'Gaussian- weighted win- 
dow with a of 3 times that of the current smoothing scale. 
These weights are multiplied by the thresholded gradient 
values and accumulated in the histogram at locations corre- 
sponding to the orientation, R The histogram has 36 bins 
covering the 360 degree range of rotations, and is smoothed 
prior to peak selection. 

The stability of the resulting keys can be tested by sub- 
jecting natural images- to' affine projection, contrast and 
brightness changes, and addition of noise. The location of 
each key detected in the first image can be predicted in the 
transformed i mage" from knowledge of the transform param- 
eters. This framework was used to select the various sam- 
pling and smoothing parameters given above, so that max- 



Image transformation 


Match % 


Oi l % 
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H. All of A,B,C,D,E,G.; 
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Figure 2* For Various image transformations applied to a 
sample of 20 images- this table gives the percent pfkeys that 
are found at match ing^ocation^ahd^scal^ (MatcK/o) and 
that also matcri 1 in prient^tipn :; ' \' f '\w~ft : $Z 



imum efficiency ^uld be obtained while rfe^niri'g stability 
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to changes. 

Figure 1 shows a relatively small number of keys de- 
tected over a 2 octave range of only the larger scales (to 
avoid excessive clutter). Each key is shown as a square, with 
a line from the center to one side of the square indicating ori- 
entation. In the second half of this figure, the image is ro- 
tated by 15 degrees, scaled by a factor of 0.9, and stretched 
by a factor of 1 .1 in the horizontal direction. The pixel inten- 
sities, in the range of 0 to 1, have 0.1 subtracted from their 
brightness values and the contrast reduced by multiplication 
by 0.9. Random pixel noise is then added to give less than 
5 bits/pixel of signal. In spite of these transformations, 7S A 
of the keys in the first image had closely matching keys in 
the second image at the predicted locations, scales, and ori- 
entations 

The overall stability of the keys to image transformations 
can be judged from Table 2. Each entry in this table is gen- 
erated from combining the results of 20 diverse test images 
and summarizes the matching of about 15,000 keys. Each 
line of the table shows a particular image transformation. 
The first figure gives the percent of keys that have a match- 
ing key in the transformed image within <r in location (rel- 
ative to scale for that key) and a factor of 1 .5 in scale. The 
second column gives the percent that match these criteria as 
well as having an orientation within 20 degrees of the pre- 
diction. 

4. Local image description 

Given a stable location, scale, and orientation for each key, it 
is now possible to describe the local image region in a man- 
ner invariant to these transformations. In addition, it is desir- 
able to make this representation robust against small shifts 
in local geometry, such as arise from affine or 3D projection. 



One approach to this is suggested by the response properties 
of complex neurons in the visual cortex, in which a feature 
position is allowed to vary over a small region while orienta- 
tion and spatial frequency specificity are maintained. Edel- 
man, Intrator & Poggio [5] have performed experiments that 
simulated the responses of complex neurons to different 3D 
views of computer graphic models, and found that the com- 
plex cell outputs provided much better discrimination than 
simple correlation-based matching. This can be seen, for ex- 
ample, if an affine projection stretches an image in one di- 
rection relative to another, which changes the relative loca- 
tions of gradient features while having a smaller effect on 
their orientations and spatial frequencies. 

This robustness to local geometric distortion can be ob- 
tained by representing the local image region with multiple 
images representing each of a number of orientations (re- 
ferred to as orientation planes). Each orientation plane con- 
tains only the gradients corresponding to that orientation, 
with linear interpolation used for intermediate orientations. 
Each orientation plane isblurred and resampled to allow for 
larger shifts in positions of the gradients. 

This approach can be efficiently implemented by using 
the same precomputed gradients and orientations for each 
level of the pyramid that were used for orientation selection. 
For each keypoint, we use the pixel sampling from the pyra- 
mid level at which the key was detected. The pixels that fall 
in a circle of radius 8 pixels around the key location are in- 
serted into the orientation planes. The orientation is mea- 
sured relative to that of the key by subtracting the key's ori- 
entation. For our experiments we used 8 orientation planes, 
each sampled over a A x A grid of locations, with a sample 
spacing 4 times that of the pixel spacing used for gradient _ 
detection. The blurring is achieved by allocating the gradi- 
ent of each pixel among its 8 closest neighbors in the sam- 
ple grid, using linear interpolation in orientation and the two 
spatial dimensions. This implementation is much more effi- 
cient than performing explicit blurring and resampling, yet 
gives almost equivalent results. 

In order to sample the image at a larger scale, the same 
process is repeated for a second level of the pyramid one oc- 
tave higher. However, this time a 2 x 2 rather than a 4 x 4 
sample region is used. This means that approximately the 
same image region will be examined at both scales, so that 
any nearby occlusions will not affect one scale more than the 
other Therefore, the total number of samples in the SIFT 
key vector, from both scales, is 8x4x4 + 8x2x2 or 160 
elements, giving enough measurements for high specificity. 

5. Indexing and matching 

For indexing, we need to store the SIFT keys for sample im- 
ages and then identify matching keys from new images. The 
problem of identifying the most similar keys for high dimen- 
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si ona 1 vectors i s kfi own to h a veT-h i gti" comp 1 ex i ty if an ex - 
act solution-is required. However/a;modJ^cation 
tree algorithrft^all^ 

& LoweI[3j)Xcan- 'idAti^thc^nearest^ neighbors;'^ 
probability: using only a) imitiea amount of cornpMta^pn*-; To 
further improye th"e efficiency of the best-bin^first algorithm, 
the SIFT Re^am^ 

twice the wergHt^jSll^e a the^smajj en scale i ^^is means 

that the l|iler^ 

neighbowl;;^^^ 

proves retbgmtioJ;^e^ to 
th e 1 east-h%fsy^cSe r :- ^In^Ur.^xpefi men is,*: it\is possible to 
have a q^bff^fjo^ a 
probabilistic beshtin-tVstsearch of 30,000 key yectors'with 
almost no lossfOfpeffiS 

soiuuon = < - . -; •; -. ■ • 

is to useghe^Hough transform [1] search fp^f^ that 
agree upfen^a^ in the 

databas£^ parameters relative 

to the m*ode$^ 

an entry in a" hash table predicting- the model location; ori- 
entation, an J scale from~the match h\potheMC.*\We use a 
bin sizeW3u*de^e*:f^ 

and 0.25 times the maximum model dimension for location. 
These rather broad bin sizes allow for clustering even in the 
presence of substantial geometric distortion, such as due to a 
change in 3D viewpoint. To avoid the problem of boundary 
effects in hashing, each hypothesis is hashed into the 2 clos- 
est bins in each dimension, giving a total of 1 6 hash table 
entries for each hypothesis. 



6. Solution for affine parameters 

The hash table is searched to identify all clusters of at least 
3 entries in a bin, and the bins are sorted into decreasing or- 
der of size. Each such cluster is then subject to a verification 
procedure in which a least-squares solution is performed for 
the affine projection parameters relating the model to the im- 
age. . . 

The affine transformation of a model point [x y] T to an 
image point [u v] T can be written as 
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Figure 3: - Model, images of planar objects, are shown in the 
top row. Recognition results below show model outlines and 
image keys used for matching:, . 

the equation above can be rewritten as 
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where the model translation is [i x <y] T and the affine rota- 
tion, scale, and stretch are represented by the parameters. 
We wish to solve for the transformation parameters, so 



This equation shows a single match, but any number of fur- 
ther matches can be added, with each match contributing 
two more rows to the first and last matrix. At least 3 matches 
are needed to provide a solution. 
We can write this linear system as 

Ax = b 

The least-squares solution for the parameters x can be deter- 
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Figure 4: Top row shows model images for 3D objects with 
outlines found by background segmentation. Bottom image 

shows recognition results for 3D objects with model outlines 
and image keys used for matching. 

mined by solving the corresponding normal equations, 

x = [A T A] -1 A T b 

which minimizes the sum of the squares of the distances 
from the projected model locations to the corresponding im- 
age locations. This least-squares approach could readily be 
extended to solving for 3D pose and internal parameters ot 
articulated and flexible objects [12]. 

Outliers can now be removed by checking for agreement 
between each image feature and the model, given the param- 
eter solution. Each match must agree within 15 degrees ori- 
entation, y/2 change in scale, and 0.2 times maximum model 
size in terms of location. If fewer than 3 points remain after 
discarding outliers, then the match is rejected. If any outliers 
are discarded, the least- squares solution is re-solved with the 
remaining points. 




Figure 5: Examples pf 3D object recogniti^ occlusion. 

7. Experiments 

The affine solution provides a good approximation to per- 
spective projection of planar objects, so planar models pro- 
vide a good initial test of the approach. The top row of Fig- 
ure 3 shows three model images of rectangular planar faces 
of objects. The figure also shows a cluttered image contain- 
ing the planar objects, and the same image is shown over- 
layed with the models following recognition. The model 
keys that are displayed are the ones used for recognition and 
final least-squares solution. Since only 3 keys are needed 
for robust recognition, it can be seen that the solutions are 
highly redundant and would survive substantial occlusion. 
Also shown are the rectangular borders of the model images, 
projected using the affine transform from the least-square 
solution. These closely agree with the true borders of the 
planar regions in the image, except for small errors intro- 
duced by the perspective projection. Similar experiments 
have been performed for many images of planar objects, and 
the recognition has proven to be robust to at least a 60 degree 
rotation of the object in any direction away from the camera. 

Although the model images and affine parameters do not 
account for rotation in depth of 3D objects, they are still 
sufficient to perform robust recognition of 3D objects over 
about a 20 degree range of rotation in depth away from each 
model view. An example of three model images is shown in 
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Figure 6'- Stability bf . image, keys is tested^under differing 
illumination. The first image is illuminated from upper left 
and the second from Wnter.righU Keys shown in the bottom 
image were those used to match, second image to first. 

the top row of Figure 4; The models were photographed on a 
black background; and object outlines extracted by segment- 
ing out the background region. Ah example of recognition is 
shown in the same figure, again showing the SIFT keys used 
for recognition/ fhe ! object outlines are projected using the 
affine parameter solution, but this time the agreement is not 
as close because the sbWtioh does not account for rotation 
in depth., Figure 5 shows more examples in which there is 
significant parti al, occlusion^ . • ( ., .... ... 

The images in. these examples are of size 384 x 512 pix- 
els. The computation times for. recognition of all objects in 
each image are about 1.5 seconds on a Sun Sparc , 10 pro- 
cessor, with about J 0.9 seconds required to build the scale- 
space pyramid and identify the SIFT keys, and about 0.6 
seconds to perform indexing and least-squares verification. 
This does not include time to pre-process each model image, 
which would be about 1 second per image, but would only 
need to be done once for initial entry into a model database. 

The illumination invariance.of the SIFT keys is demon- 
strated in Figure 6. The two images are of the same scene 
from the same viewpoint, except that the first image is il- 
luminated from the upper left and the second from the cen- 
ter right. The full recognition system is run to identify the 
second image using the first image as the model, and the 
second image is correctly recognized as matching the first. 
Only SIFT keys that were part of the recognition are shown. 
There were 273 keys that were verified as part of the final 
match, which means that in each case not only was the same 
key detected at the same location, but it also was the clos- 
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est match to the correct corresponding key in the second im- 
age. Any 3 of these keys would be sufficient for recognition. 
While matching keys are not found jn. some regions where 
highlights or shadows change (for example on the shiny top 
of the camera) in general the keys show good invariance to 
illumination change. 

8. Connections to biological vision 

The performance of human yisicTn is obviously far superior 
to that of current computer vis ion systems; so there is poten- 
tially much to be gained by emulating biological processes 
Fortunately, there have been _ dramatic im within 
the past few years in understanding how object recognition 
is accomplished in animals and humans. 

Recent research' in neuVoscience' has shown that object 
recognition in primates makes use of features of intermedi- 
ate complexity that are largely invariant to changes in scale, 
location, and il jumi nation {Tanaka [21], Perrett & Oram 
[16]) Some examples of such intermediate features found 
in inferior temporal cortex (IT) are neurons that respond to 
a dark five sided star shape, a circle with a thin protruding 
element, or a horizontal textured region within a triangular 
boundary, These neurons maintain highly specific responses 
to shape features that appear anywhere within a large por- 
tion of the visual field and over a several octave range of 
scales (Ito et al [7]) The complexity of many of these fea- 
tures appears to be roughly the same as for the current SIFT 
features, al tnougn there' are alsd some neurons that respond 
to more complex shapes, such as faces. Many of the neu- 
rons respond to color and texture properties in addition to 
shape. The feature responses have been shown to depend 
on previous visual learning from exposure to specific objects 
containing the features (I^gqthetis, Pauls ; & Poggiq .-J 10]). 
These features appear to be -derived in. the brain by a highly 
computation-intensive parallel process, which is quite dif- 
ferent from the staged filtering approach given in this paper. 
However, the results are muchUhe same: an image .is trans- 
formed into a large set of local features that each match a 
small fraction of potential objects yet are largely invariant 
to common viewing transformations. ... 

It is also known that object. recognition in the brain de- 
pends on a serial process of attention to bind features to ob- 
ject interpretations, determine pose, and segment an object 
from a cluttered background [22]. This process is presum- 
ably playing the same role in verification as the parameter 
solving and outlier detection used" in this paper, since the 
accuracy of interpretations can often depend on enforcing a 
single viewpoint constraint [1 1]. 

9. Conclusions and comments 

The SIFT features improve on previous approaches by being 
largely invariant to changes in scale, illumination, and local 



affine distortions. The large number of features in a typical 
image allow for robust recognition under partial occlusion in 
cluttered images. A final stage that solves for amne model 
parameters allows for more accurate verification and pose 
determination than in approaches thatrely only on indexing. 

An important area for further research is to build models 
from multiple views that represent the 3D structure of ob- 
jects. This would have the further advantage that keys from 
multiple viewing conditions could be combined into a single 
model, thereby increasing the probability of finding matches 
in new views. The models could be true 3D representations 
based on structure-from-motion solutions, or could repre- 
sent the space of appearance in terms of automated cluster- 
ing and interpolation (Pope & Lowe [17]). An advantage of 
the latter approach is that it could also model non-rigid de- 
formations. ...<[.,,.... 

The recognition performance could be further improved 
by adding new SIFT feature types to incorporate color, tex- 
ture, and edge groupings, as well as varying feature sizes 
and offsets. ScaleTinvariant edge groupings that make local 
figure-ground discriminations would be particularly useful 
at object boundaries where background clutter can interfere 
with other features. Trie indexing and verification frame- 
work allows for all types of scale and rotation invariant fea- 
tures to be incorporated into a single model representation. 
Maximum robustness would be achieved by detecting many 
different feature types and relying on the indexing and clus- 
tering to select those that are most useful in a particular im- 
age. 
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